GB-CENT: Gradient Boosted Categorical Embedding and Numerical Trees

نویسندگان

  • Qian Zhao
  • Yue Shi
  • Liangjie Hong
چکیده

Latent factor models and decision tree based models are widely used in tasks of prediction, ranking and recommendation. Latent factor models have the advantage of interpreting categorical features by a low-dimensional representation, while such an interpretation does not naturally fit numerical features. In contrast, decision tree based models enjoy the advantage of capturing the nonlinear interactions of numerical features, while their capability of handling categorical features is limited by the cardinality of those features. Since in real-world applications we usually have both abundant numerical features and categorical features with large cardinality (e.g. geolocations, IDs, tags etc.), we design a new model, called GB-CENT, which leverages latent factor embedding and tree components to achieve the merits of both while avoiding their demerits. With two real-world data sets, we demonstrate that GB-CENT can effectively (i.e. fast and accurately) achieve better accuracy than state-of-the-art matrix factorization, decision tree based models and their ensemble.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimization with Gradient-Boosted Trees and Risk Control

Decision trees effectively represent the sparse, high dimensional and noisy nature of chemical data from experiments. Having learned a function from this data, we may want to thereafter optimize the function, e.g., picking the best chemical process catalyst. In this way, we may repurpose legacy predictive models. This work studies a large-scale, industrially-relevant mixed-integer quadratic opt...

متن کامل

Boosted trees for ecological modeling and prediction.

Accurate prediction and explanation are fundamental objectives of statistical analysis, yet they seldom coincide. Boosted trees are a statistical learning method that attains both of these objectives for regression and classification analyses. They can deal with many types of response variables (numeric, categorical, and censored), loss functions (Gaussian, binomial, Poisson, and robust), and p...

متن کامل

TF Boosted Trees: A Scalable TensorFlow Based Framework for Gradient Boosting

TF Boosted Trees (TFBT) is a new open-sourced framework for the distributed training of gradient boosted trees. It is based on TensorFlow, and its distinguishing features include a novel architecture, automatic loss differentiation, layer-by-layer boosting that results in smaller ensembles and faster prediction, principled multi-class handling, and a number of regularization techniques to preve...

متن کامل

A one-dimensional model for variations of longitudinal wave velocity under different thermal conditions

Ultrasonic testing is a versatile and important nondestructive testing method. In many industrial applications, ultrasonic testing is carried out at relatively high temperatures. Since the ultrasonic w...

متن کامل

Multi-field Categorical Data

This paper presents a method of learning distributed representation for multi-field categorical data, which is a common data format with various applications such as recommender systems, social link prediction, and computational advertising. The success of non-linear models, e.g., factorisation machines, boosted trees, has proved the potential of exploring the interactions among inter-field cat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017